Bib numbers are unique numbers used to identify each runner before, during, and after the race. During the race, the bib number is actually worn by the runner as a unique identifier. In some races like the Boston Marathon, bib numbers are given out in batches and used to organize the waves in which runners start a race. To make the start of a 26,000 person race more organized, the Boston Marathon in 2017 broke the runners into four, color-coded groups. To determine what group (or wave) a runner would be in, the marathon organizers used previously sumbitted qualifying times, as detailed below. (http://registration.baa.org/2017/cf/Public/iframe_EntryLists.cfm)
Red bibs (numbers 101 to 7,700) are assigned to Wave 1 (starting at 10:00 a.m.). White bibs (numbers 8,000 to 15,600) are assigned to Wave 2 (starting at 10:25 a.m.). Blue bibs (numbers 16,000 to 23,600) are assigned to Wave 3 (starting at 10:50 a.m.) Yellow bibs (numbers 24,000 to 32,500) are assigned to Wave 4 (starting at 11:15 a.m.). The break between Wave 1 and Wave 2 is a 3:10:43 marathon qualifying time. The break between Wave 2 and Wave 3 is a 3:29:27 marathon qualifying time. The break between Wave 3 and Wave 4 is a 3:57:18 marathon qualifying time.
The question at hand is can we develop an unsupervised clustering model that accurately identifies these groupings without using the information from the above paragraph? An additional question is can we confirm that the fourth group also includes runners who did not have to qualify for the marathon but instead or running for a charity group.
We first need to convert the bib number from a factor to an int. We have to convert the factor to a charachter first though, because directly converting a factor to an int returns the underlying factor level, not the integer a factor may repersent.
Next, we can plot the finishing time against bib number and start to see several trends. First, finishing times slowly yet steadily increase, supporting the theory that faster finishers get lower bib numbers. Second, there are about four observable clusters, which match the waves organized by the Boston Marathon at the start. Finally, the last group has much more variance within it, and far slower average finishing times. These are likely the bib numbers of charity runners and other runners who did not need to qualify for the race.
## Warning: NAs introduced by coercion
This plot is rather dense, so lets use a density scatterplot to better see the distribution of the data.
From the histogram below, we can see that there are some gaps from “unused” bib numbers or runners that did not finish the race. The gap of unused bin numbers correlates to the breaks in the different wave groups, which has the added advantage of sharpening the edges of our clusters and hopefully making it easier for our model to successfully identify the correct groupings.
Now lets label the data with the right group names so we can compare our model’s output.
Red bibs (numbers 101 to 7,700) are assigned to Wave 1 (starting at 10:00 a.m.). White bibs (numbers 8,000 to 15,600) are assigned to Wave 2 (starting at 10:25 a.m.). Blue bibs (numbers 16,000 to 23,600) are assigned to Wave 3 (starting at 10:50 a.m.) Yellow bibs (numbers 24,000 to 32,500) are assigned to Wave 4 (starting at 11:15 a.m.). The break between Wave 1 and Wave 2 is a 3:10:43 marathon qualifying time. The break between Wave 2 and Wave 3 is a 3:29:27 marathon qualifying time. The break between Wave 3 and Wave 4 is a 3:57:18 marathon qualifying time.
We can try K-means clustering to see if the algorithm can successfully identify the known clusters. Since there are four known clusters, we will provide ‘4’ as a parameter for the K-means algorithm. Additionally, since K-means starts with a random division of elements, we will set the random seed at one and run K-means 20 times, keeping the most accurate model.
As you can see in the above graph, K-means successfully identifes all four clusters and their break points perfectly. This model works well here in part because of the breaks between the four clusters.
Because of the memory usage required to process hierarchical clustering with over 26,000 observations, this processing has been moved to a separate rmd.
From our analysis so far, the K-means clustering appears to be the best model for this problemset. Because it is a top down approach that fits the data into the number of clusters provided as an input to the model, it does an effective job of successfully finding the break points for this data. On the other hand, the hierarchical approach is less effective at identifying these break points because each successive level of hierarchical clustering is looking for the next nearest node, rather than taking the entire data into account.
Because we know from our previous analysis that the forth quarter rank contributes the most to the final rank most, we modeled how people’s previous performance, represented by their bib-number (***mention something about clustering analysis before), their age and their gender affects their fourth quarter rank.
##
## Call:
## lm(formula = X30.42K.Rank ~ Bib_int + M.F + Age, data = bm_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22169.7 -4411.6 -505.7 3910.6 21603.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.663e+03 1.435e+02 11.59 <2e-16 ***
## Bib_int 5.572e-01 4.417e-03 126.15 <2e-16 ***
## M.FM 1.418e+03 8.076e+01 17.56 <2e-16 ***
## Age 4.557e+01 3.338e+00 13.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5728 on 26360 degrees of freedom
## Multiple R-squared: 0.4336, Adjusted R-squared: 0.4335
## F-statistic: 6727 on 3 and 26360 DF, p-value: < 2.2e-16
## Bib_int M.FM Age
## 1.300902 1.298240 1.167188
In a model predicting the fourth quarter rank by bib number, gender, and age, slope/intercept value is 1.66e+03 and is significant (p<.001). All predictors, bib number, gender, and age significatly predicts total users (p<0.001). One unit increase in bib number positively impacts (+0.557) fourth quarter rank (p<0.001), meaning increased bib number (slower past performance) is likely to predict higher fourth quarter rank (slower fourth quarter performance). Being male (vs. female) positively impacts (+1420) fourth quarter rank (p<0.001). One unit increase in age positively impacts (+45.6) fourth quarter rank (p<0.001), meaning older people will likely have higher fourth quarter rank (slower fourth quarter performance). R2 (adjusted) is .434, which means that 43.4% of variability in fourth quarter rank is explained by bib number, gender, and age. The VIFs are 1.17-1.30, representing low multicollinearity.
We also looked to see if rank in each quarter of the race, age, and gender, to see their effect on final official time and the effect on overall rank.
##
## Call:
## lm(formula = Official.Time.Min ~ X0.10K.Rank + X10.20K.Rank +
## X20.30K.Rank + X30.42K.Rank + M.F + Age, data = bm_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -106.555 -7.703 -1.614 5.424 163.943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.671e+02 2.880e-01 580.19 <2e-16 ***
## X0.10K.Rank 1.168e-03 3.054e-05 38.23 <2e-16 ***
## X10.20K.Rank 1.118e-03 4.596e-05 24.32 <2e-16 ***
## X20.30K.Rank 1.256e-03 3.983e-05 31.52 <2e-16 ***
## X30.42K.Rank 2.217e-03 2.222e-05 99.79 <2e-16 ***
## M.FM 3.369e+00 1.696e-01 19.86 <2e-16 ***
## Age -1.568e-01 6.934e-03 -22.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.61 on 26357 degrees of freedom
## Multiple R-squared: 0.9238, Adjusted R-squared: 0.9238
## F-statistic: 5.329e+04 on 6 and 26357 DF, p-value: < 2.2e-16
## X0.10K.Rank X10.20K.Rank X20.30K.Rank X30.42K.Rank M.FM
## 10.564164 23.925850 17.968686 5.592565 1.393928
## Age
## 1.225789
##
## Call:
## lm(formula = Overall ~ X0.10K.Rank + X10.20K.Rank + X20.30K.Rank +
## X30.42K.Rank + M.F + Age, data = bm_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21645.5 -215.4 -53.2 172.4 14471.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.086e+02 1.851e+01 -32.874 < 2e-16 ***
## X0.10K.Rank 1.859e-01 1.963e-03 94.720 < 2e-16 ***
## X10.20K.Rank 1.745e-01 2.954e-03 59.049 < 2e-16 ***
## X20.30K.Rank 2.902e-01 2.560e-03 113.348 < 2e-16 ***
## X30.42K.Rank 4.014e-01 1.428e-03 281.027 < 2e-16 ***
## M.FM -3.653e+01 1.090e+01 -3.350 0.000808 ***
## Age -2.702e-01 4.457e-01 -0.606 0.544320
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 746.4 on 26357 degrees of freedom
## Multiple R-squared: 0.9904, Adjusted R-squared: 0.9904
## F-statistic: 4.527e+05 on 6 and 26357 DF, p-value: < 2.2e-16
## X0.10K.Rank X10.20K.Rank X20.30K.Rank X30.42K.Rank M.FM
## 10.564164 23.925850 17.968686 5.592565 1.393928
## Age
## 1.225789
Ranks in each quarter of the race, gender, and age all have significant effect on offcial finishing time. However, VIFs of rank in 0-10k, 10-20k, and 20-30k show that the independent variables are highly correlated. Considering our previous analysis that the forth quarter performance contributes the most to the final rank, we rebuilt new model after droping first three quarters’ rank.
##
## Call:
## lm(formula = Official.Time.Min ~ X30.42K.Rank + M.F + Age, data = bm_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -124.170 -9.890 0.074 8.129 180.538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.727e+02 4.250e-01 406.34 <2e-16 ***
## X30.42K.Rank 4.833e-03 1.465e-05 329.89 <2e-16 ***
## M.FM -1.070e+01 2.229e-01 -48.01 <2e-16 ***
## Age 1.792e-01 9.881e-03 18.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.25 on 26360 degrees of freedom
## Multiple R-squared: 0.8318, Adjusted R-squared: 0.8318
## F-statistic: 4.346e+04 on 3 and 26360 DF, p-value: < 2.2e-16
## X30.42K.Rank M.FM Age
## 1.100936 1.090192 1.127243
In a model predicting the official time by fourth quarter rank, gender, and age, slope/intercept value is 173 and is significant (p<.001). All predictors, fourth quarter rank, gender, and age significatly predict total users (p<0.001). R2 (adjusted) is .832, which means that 83.2% of variability in the offical time is explained by fourth quarter rank, gender, and age. The VIFs are 1.09-1.13, representing low multicollinearity.
One unit increase in fourth quarter rank positively impacts (+0.0048) official time. Being male (vs. female) negatively impacts (-10.70) official time. One unit increase in age positively impacts (0.1792) official time.
##
## Call:
## lm(formula = Overall ~ X30.42K.Rank + M.F + Age, data = bm_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23503.5 -1251.2 118.5 1394.2 18545.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.733e+02 5.884e+01 6.345 2.27e-10 ***
## X30.42K.Rank 8.916e-01 2.028e-03 439.646 < 2e-16 ***
## M.FM -2.521e+03 3.086e+01 -81.699 < 2e-16 ***
## Age 5.824e+01 1.368e+00 42.578 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2389 on 26360 degrees of freedom
## Multiple R-squared: 0.9016, Adjusted R-squared: 0.9016
## F-statistic: 8.049e+04 on 3 and 26360 DF, p-value: < 2.2e-16
## X30.42K.Rank M.FM Age
## 1.100936 1.090192 1.127243
In the improved overall rank predicting model, p value is down to less than 0.001, which means all variables are significant now. And VIF are also between 1.09 to 1.13, which means low multicollinearity. R2 (adjusted) is .9016, which means that 90.16% of variability in the offical time is explained by fourth quarter rank, gender, and age. So the model has been improved in an appropriate way, and fourth quater rank, gender and age all contribute to overall rank.
One unit increase in fourth quarter rank positively impacts (+0.8916) overall rank. Being male (vs. female) negatively impacts (-2521) overall rank. One unit increase in age positively impacts (58.24) overall rank.
From the following two plots, we can see overall rank and official time’s distribution are related to the fourth quarter rank, gender, and age.
And we explored whether we could predict participants’ ranks in 4 categories if we got their information about age, gender and the rank of forth quarter. We classified official time into four parts by order of official time.
Then we ran the KNN test.
It seems 12-nearest neighbors (with accuracy rate of 72.5%) is a decent choice because that’s the greatest improvement in predictive accuracy before the incremental improvement trails off. So we could predict the level of the participants by their age, rank of forth quarter and gender. The cross Table of Actual/Predicted is the following:
Let us predict the Official time of the marathon by using all the predictor variables in the data.
##
## Call:
## lm(formula = Official.Time ~ Age + M.F + X5K + X10K + X15K +
## X20K + Half + X25K + X30K + X35K + X40K, data = splits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.307 -0.601 -0.154 0.389 121.528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.891228 0.097179 9.171 < 2e-16 ***
## Age 0.014733 0.001206 12.215 < 2e-16 ***
## M.FM 0.208143 0.028784 7.231 4.92e-13 ***
## X5K -0.025793 0.026224 -0.984 0.325347
## X10K 0.024532 0.027701 0.886 0.375839
## X15K 0.023922 0.018976 1.261 0.207449
## X20K -0.165660 0.039651 -4.178 2.95e-05 ***
## Half 0.168417 0.039072 4.310 1.64e-05 ***
## X25K -0.043706 0.011515 -3.796 0.000148 ***
## X30K -0.008966 0.008662 -1.035 0.300603
## X35K -0.216029 0.006934 -31.154 < 2e-16 ***
## X40K 1.253935 0.003514 356.793 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.042 on 26200 degrees of freedom
## Multiple R-squared: 0.9976, Adjusted R-squared: 0.9976
## F-statistic: 1.007e+06 on 11 and 26200 DF, p-value: < 2.2e-16
## Age M.FM X5K X10K X15K X20K
## 1.193197 1.290768 68.169149 309.461352 343.112512 2931.789934
## Half X25K X30K X35K X40K
## 3187.889602 425.862092 379.551362 356.622865 123.269791
In the model predicting the Official time of the marathon using Age, Gender and all the split times, we can see that all the coefficients are not significant. Only the intercept, Age and some of the split times are significant. R2(Adjusted) is .99 which is almost close to 1. This is because we have used all the split times till 40K. So, the model is explaining 99% of the variability in the Official time using all the split times, Gender, Age. But, the main problem is with the multicollinearity in the data. If we look at the vif values of the split times, they are more than 10 and this suggests that there is lot of multicollinearity in the data. We will overcome this by using PCA, PCR. Before that, let’s just predict the Official time of the marathon using Age, Gender and Half time of the marathon.
##
## Call:
## lm(formula = Official.Time ~ Age + M.F + Half, data = splits)
##
## Residuals:
## Min 1Q Median 3Q Max
## -129.710 -8.495 -2.382 5.639 192.668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.599737 0.587650 -19.739 <2e-16 ***
## Age -0.015724 0.007779 -2.021 0.0432 *
## M.FM 5.965982 0.183971 32.429 <2e-16 ***
## Half 2.235712 0.005038 443.790 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 26208 degrees of freedom
## Multiple R-squared: 0.8988, Adjusted R-squared: 0.8988
## F-statistic: 7.758e+04 on 3 and 26208 DF, p-value: < 2.2e-16
## Age M.FM Half
## 1.157320 1.229639 1.235855
In this model, we predict the Official time of the marathon with Age, Gender and Half time of the marathon. All the coefficients are statistically significant. R2(Adjusted) is .8993. This means it is explaining 89% of the variability in Official Times with Age, Gender and Half time. And also, VIF values are 1.15, 1.22, 1.23 which indicates that there is no multicollinearity.
Let us see whether we can apply Principal Component Analysis and Regression to our split times.
Principal Component analysis of the split times suggest that 1st Principal Component i.e PC1 covers 97% of the variance in the data. PC1 and PC2 cover 99% of the variance in the data. We can observe this in the above plot. Let us perform regression analysis based on the Principal Component Analysis which is Principal Component Regression(PCR). We can validate the R2 based on the number of componenets used.
## Data: X dimension: 26212 9
## Y dimension: 26212 1
## Fit method: svdpc
## Number of components considered: 9
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 42.02 11.58 5.578 3.384 2.701 2.322 2.104
## adjCV 42.02 11.58 5.578 3.384 2.701 2.321 2.103
## 7 comps 8 comps 9 comps
## CV 2.100 2.057 2.056
## adjCV 2.099 2.056 2.056
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## X 97.00 99.32 99.78 99.89 99.94 99.96
## Official.Time 92.41 98.24 99.35 99.59 99.70 99.75
## 7 comps 8 comps 9 comps
## X 99.98 100.00 100.00
## Official.Time 99.75 99.76 99.76
In the Principal Component Regression, 1 component cover 97% of the variance in the split times and 92% of the variance in the Official time where as 2 components cover 99% of the variance in the split times and 98% of the variance in the Official time. R2 value is close to .95 if we consider 1 component and it is close to .98 if we considered 2 compoenents i.e 2 principal components cover 98% of the variance in the Official time.
Let’s drop columns that are not required and handle missing values in the dataset
There are 26259 rows after dropping the columns and removing NA and blank values.
Let’s Calculate Pace of the runners between each time split. Pace is calculated as time taken per km. So if your pace is 5’ 20" and you have your measurement set to metric, it means you have taken 5 mins & 20 secs to run each kilometer. Lower values mean you are travelling faster. For 2017 Boston marathon, total distance that needs to be covered by runner is 42.195 km. So for half time, runner has to cover 21.1 km.
Let’s subset age and pace columns until the Half time from the dataset which will be features for our models and will be predicting the Official Time.
Let’s split the dataset into 70% train set and 30% test set and setting the seed to 1 so that there is no randomness in our split data.
Let’s plot correlation matrix of training data.
We can observe that there exists a multi-collinearity between the pace variables and also the pace variables are strongly correlated with Official Time.
In contrast to the ordinary least square regression, ridge regression is highly affected by the scale of the predictors. Therefore, it is better to standardize (i.e., scale) the predictors before applying the Ridge and Lasso regression, so that all the predictors are on the same scale. So let’s scale X_train, X_test, y_train and y_test.
Let’s build an Ordinary Least Squares model using our train data and predict official time with test data.
##
## Call:
## lm(formula = Official.Time.Min ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -204.932 -7.252 -2.160 4.780 120.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.796316 0.676113 -2.657 0.00789 **
## Age 0.022852 0.008584 2.662 0.00777 **
## M.FM 4.576376 0.200729 22.799 < 2e-16 ***
## Pace0k.5k 6.881673 0.410370 16.769 < 2e-16 ***
## Pace5k.10k -0.533937 0.594195 -0.899 0.36888
## Pace10k.15k 11.597146 0.493059 23.521 < 2e-16 ***
## Pace15k.20k 14.046695 0.278331 50.468 < 2e-16 ***
## Pace20k.Half 12.522291 0.255354 49.039 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.2 on 18373 degrees of freedom
## Multiple R-squared: 0.9169, Adjusted R-squared: 0.9168
## F-statistic: 2.894e+04 on 7 and 18373 DF, p-value: < 2.2e-16
## Age M.FM Pace0k.5k Pace5k.10k Pace10k.15k
## 1.182851 1.232578 13.203802 29.891158 23.848568
## Pace15k.20k Pace20k.Half
## 10.475560 7.854640
## R2 RMSE MAE
## 1 0.9179003 11.94196 8.264375
91.7900286% variability in Official time is explained by the predictor variables. Pace between 15k and 20k and Pace between 20k and 21.1k(Half time) are top two important features in determining the official time. We can also observe from vif values that there exists a multi-collinearity between independent variables. So let’s build a model which will help us in solving multi-collinearity problem.
Ridge regression shrinks the coefficients of the independent variables to prevent multicollinearity. We need to calculate the regularization parameter lambda that adjusts the amount of coefficient shrinkage. The best lambda for the data, can be defined as the lambda that minimize the cross-validation prediction error rate. This can be determined using the function cv.glmnet().
The first vertical dotted line is where the lowest MSE is. The second vertical dotted line is within one standard error. The labels of above the graph shows how many non-zero coefficients in the model. The best lambda value found here is 0.0934576. Let’s predict the Official time for test data using the best lambda value.
## [1] 0.08550857
## MAE RMSE X1
## 1 0.2064382 0.2924185 0.9122858
## Age M.FM Pace0k.5k Pace5k.10k Pace10k.15k
## 0.03490199 3.84633811 6.00007979 5.85997084 9.56233432
## Pace15k.20k Pace20k.Half
## 11.56372032 11.26372657
We got an r-square value of 91.2285772% using Ridge regression.
Lasso regression also reduces the multi-collinearity between variables by shrinking the coefficients to zero. The same function glmnet( ) with alpha set to 1 will build the Lasso regression model.
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to
## unique 'x' values
Here, we see that the lowest MSE is when \(\lambda\) appro = 0.00465. It has 5 non-zero coefficients.
## [1] 0.08211911
## MAE RMSE X1
## 1 0.2064382 0.2924185 0.9122858
## Age M.FM Pace0k.5k Pace5k.10k Pace10k.15k
## 0.0214768 4.5631016 6.6009531 0.0000000 11.3663131
## Pace15k.20k Pace20k.Half
## 14.0402185 12.5053273
## Age M.FM Pace0k.5k Pace10k.15k Pace15k.20k
## 0.0214768 4.5631016 6.6009531 11.3663131 14.0402185
## Pace20k.Half
## 12.5053273
The main problem with lasso regression is when we have correlated variables, it retains only one variable and sets other correlated variables to zero. That will possibly lead to some loss of information resulting in lower accuracy in our model.
##
## Regression tree:
## rpart(formula = Official.Time.Min ~ ., data = train1, method = "anova")
##
## Variables actually used in tree construction:
## [1] Pace10k.15k Pace15k.20k Pace20k.Half
##
## Root node error: 32913426/18381 = 1790.6
##
## n= 18381
##
## CP nsplit rel error xerror xstd
## 1 0.595813 0 1.00000 1.00008 0.0111647
## 2 0.128601 1 0.40419 0.40571 0.0046479
## 3 0.096503 2 0.27559 0.28004 0.0041649
## 4 0.022903 3 0.17908 0.18117 0.0028424
## 5 0.018829 4 0.15618 0.16345 0.0027726
## 6 0.015728 5 0.13735 0.14101 0.0026243
## 7 0.011889 6 0.12162 0.12213 0.0024654
## 8 0.010000 7 0.10973 0.11272 0.0023320
## R2 RMSE MAE
## 1 0.8863439 14.05541 10.19628
We got an r-square value of 88.6343888% using Decision Tree. The main disadvantage of decision tree is that, we get low bias and high variance predictions when we have sufficient depth in the tree. High variance means decision tree model changes a lot with changes in training data resulting in changes in accuracy. In order to control the variance, we go for a technique called Bagging.
In Bagging, we take ensemble of models having low bias and high variance as the base model. By doing ensembling of models, we get low bias and low variance predictions. Below we are building an ensemble of decision trees using treebag method.
## R2 RMSE MAE
## 1 0.9005626 13.14366 9.290779
We got an r-square value of 90.0562578% using Bagging Tree. The main disadvantage of bagging is that the predictions from the decision trees are highly correlated. In order to reduce the correlation between decision trees, we go for Random forest model.
Random Forest is an extension of Bagging where we subset the features along with bootstrap of rows with replacement.
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 18381 -none- numeric
## mse 100 -none- numeric
## rsq 100 -none- numeric
## oob.times 18381 -none- numeric
## importance 7 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 18381 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
##
## Call:
## randomForest(formula = Official.Time.Min ~ ., data = train1, ntree = 100)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 127.8989
## % Var explained: 92.86
## R2 RMSE MAE
## 1 0.9005626 13.14366 9.290779
We got an r-square value of 90.0562578% using Random Forest.
Boosting is another approach for improving the predictions resulting from Decision Tree. In Gradient Boosting, we build ensemble of decision trees sequentially and the predictions of individual trees are summed sequentially. Every decision tree tries to recover the loss (difference between actual and predicted values) by fitting the tree on residuals of the previous tree. This results in model with better accuracy.
We got an r-square value of 91.5356632% using Gradient Boosting. We can see that Pace between 15k and 20k and Pace between 20k and 21.1K are by far the most important variables in our gbm model.
Extreme Gradient Boosting or xgboost gives better approximations of predictions over gradient boosting and computationally it is fast in training the data.
We got an r-square value of 92.4807346% using Gradient Boosting.
In our previous data exploration, we conducted extensive analysis of how a runner’s hometown affected finishing time. We found strong differneces amongst home states and home countries, with faster runners coming from African countires like Ethiopia and Kenya and slower runners coming from New England states near the start of the marathon.
We were curious to see if the distance traveled had a more direct relationship with finishing time. To answer this question, we had to do some feature engineering. In data science, we rarely have all the data needed and frequenlty need to build our own features based on some aspects of the existing data.
To attak this problem, we leveraged work from a previous class using Python to pass place names to Google Map’s geocoding API, which returns locational metadata including lattitude and longitude. To determine distance between a runner’s provided hometown and the race start, we built a custom function to calculate the distance using the haversine forumla. All details are included in the attached Jupyter Notebook and html files. Once the caclulations were complete, the distance traveled for each runner was merged back with our original dataset using bib number as a unique identifier.
First we will build a linear model, and then use that model to predict the finishing time for each runner as a function of the distance traveled from their hometown. Once we build the model, we can plot the actual data with the predicted data as a line. As you can see from the plot below, there appears to be a very weak negative correlation between distance traveled and finishing time. Let’s examine the model output for more details.
We can further analyze the model by looking at specific outputs. The R-squared value of this model is 0.00713. The p-value for the distance term is 0. The coefficent for the distance variable is -0.00203. The VIF is 1.
These outputs confirm what we saw from the plot. For every mile further away a runner’s home town is, a runner is expected to run 0.00203 minutes faster. However, we can see that this model is a very poor fit for the data with very high error residuals.
This is an excellent example that really detailed feature engineering and even a very small p-value does not lead to a good model. Distance traveled results in a low adjusted R-squared value of 0, which is definitely stastically significant. However, the model has little predictive power because of its low adjusted R-squared value and obviously large error residuals.